Plot - 1 : Heatmap

The depicted heatmap represents the correlation between each feature. In this, we can view that there are close relationships between the following:

debt_to_income and annual_income_joint

open_credit_lines and num_satisfactory_accounts

current_accounts_delinq and num_accounts_30d_past_due

Plot - 2 : Loan exposure by state

This chart illustrates loan taken state wise in the USA. Hovering over the chart will show the count of loans in that particular state.

Plot - 3 : Loan Status Distribution

Status of the Loan is of 6 types. In the below pie chart, we can view each combination of types and their split percentage. As a Lending Club, we might be more interested in noting the defaulters percentage. This can be done by just selecting the two 'Late' categories from legends.

Plot - 4 : Risk Level Analysis

The population of loan requestors are classified into 7 grades (A to G). Wherein, Grade-G is the highest risk category who will have to pay more interest and Grade-A is the lowest risk category. Amongst these 7 categories, there are 5 sub-grading categories. The depicted stack will represent each grade with it's respective sub-grades splitted based on the available dataset.

Plot - 5 : Payment Failure-Interest Rate Analysis

Dataset is split into No-Risk, Low, Medium and High risk categories based on the column "num_historical_failed_to_pay". The population of loan requestors are being classified as 'Low' risk when the number of times they haved failed to re-pay the loan is less in the past.

In the portrayed chart, we can view that interest rate is strictly within a lower range for No-Risk population. Whereas, the rate of interest varies more for the other three Risk categories.

Model - 1 : LinearRegression

Response variable to be measured is interest rate (Continuos variable). We can use a Linear Regression model to measure a continuos variable.

In our dataset, there are a few categorical variables to be dealt with when building a Linear Regression model. We can use two methods to cleanse our dataset,

(1) Factorizing the categorical varibale

(2) One-hot encoding method

The accuracy of the model increases to 99.97% from 74.19% on using On-hot encoding method.

MODEL - 2 : RandomForestRegressor

As One-hot encoding is a bteer approach. We continue to use the same train and test dataset for our second model. The predictions of the test and train dataset allign to form a Linear line to show the goodness of the model.

Result Visualisation:

The results of the RandomForest model is as below:

Given more time, I would have been able to analyse the data more deeper, and build better models and experiment different approaches. Also, I can improve the co-efficients and used boosting methods. In-depth analysis of the data would help me provide more informative visualizations.